0. Introduction

Obesity is a major risk factor for a number of chronic diseases, including diabetes mellitus, cardiovascular diseases, and cancer, and according to the World Health Organization, it has become a health challenge of epidemic proportions in the last few decades in the western and westernized societies.

Although highly criticized, the principal method for determining obesity in adults is through the Body Mass Index (BMI). BMI classification is as follows (from the CDC):

Category Underweight Healthy Overweight Obese
BMI Less than 18.5 18.5 to 24.9 25.0 to 29.9 More than 30

In this project, I will analyze the relationship between nutritional behaviours (BMI & blood cholesterol levels) on worldwide populations and certain types of Cancer using information available in the Gapminder website (see http://www.gapminder.org/data/).

1. Worldwide BMI Exploration

I’ll start exploring how the BMI dataset is structured and distributed.

1.1. Dataset structure

## 'data.frame':    199 obs. of  30 variables:
##  $ Country: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ X1980  : num  21.5 25.2 22.3 25.7 20.9 ...
##  $ X1981  : num  21.5 25.2 22.3 25.7 20.9 ...
##  $ X1982  : num  21.5 25.3 22.4 25.7 20.9 ...
##  $ X1983  : num  21.4 25.3 22.5 25.8 20.9 ...
##  $ X1984  : num  21.4 25.3 22.6 25.8 20.9 ...
##  $ X1985  : num  21.4 25.3 22.7 25.9 20.9 ...
##  $ X1986  : num  21.4 25.3 22.8 25.9 21 ...
##  $ X1987  : num  21.4 25.3 22.8 25.9 21 ...
##  $ X1988  : num  21.3 25.3 22.9 26 21 ...
##  $ X1989  : num  21.3 25.3 23 26 21.1 ...
##  $ X1990  : num  21.2 25.3 23 26.1 21.1 ...
##  $ X1991  : num  21.2 25.3 23.1 26.2 21.1 ...
##  $ X1992  : num  21.1 25.2 23.2 26.2 21.1 ...
##  $ X1993  : num  21.1 25.2 23.3 26.3 21.1 ...
##  $ X1994  : num  21 25.2 23.3 26.4 21.1 ...
##  $ X1995  : num  20.9 25.3 23.4 26.4 21.2 ...
##  $ X1996  : num  20.9 25.3 23.5 26.5 21.2 ...
##  $ X1997  : num  20.8 25.3 23.5 26.6 21.2 ...
##  $ X1998  : num  20.8 25.4 23.6 26.7 21.3 ...
##  $ X1999  : num  20.8 25.5 23.7 26.8 21.3 ...
##  $ X2000  : num  20.7 25.6 23.8 26.8 21.4 ...
##  $ X2001  : num  20.6 25.7 23.9 26.9 21.4 ...
##  $ X2002  : num  20.6 25.8 24 27 21.5 ...
##  $ X2003  : num  20.6 25.9 24.1 27.1 21.6 ...
##  $ X2004  : num  20.6 26 24.2 27.2 21.7 ...
##  $ X2005  : num  20.6 26.1 24.3 27.3 21.8 ...
##  $ X2006  : num  20.6 26.2 24.4 27.4 21.9 ...
##  $ X2007  : num  20.6 26.3 24.5 27.5 22.1 ...
##  $ X2008  : num  20.6 26.4 24.6 27.6 22.3 ...

My dataset is structured in wide format. It contains data for 199 countries. The first column contains the name of the country and the rest values for the estimated average BMI in the country from the years 1980 to 2008. Male and female datasets are separated.

1.2. Dataset distribution

Let’s take a grasp on the data and start analyzing the how the BMI is dispersed worldwide at the most recent time point available (2008), for males and females.

summary(maleBMI$X2008)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.87   22.83   25.50   25.10   26.82   33.90
summary(femaleBMI$X2008)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.55   23.73   25.99   25.92   27.49   35.02

It seems that females have a slightly higher BMI than males. Let’s plot the values as histograms.

The data distribution is very roughly normal, but the discreteness and the low number of points alter this pattern. I can see how there are a few more BMI values above 28 in the females plot. This information will be more evident using boxplots.

The boxplots confirm visually that generally women have a slighter higher BMI than men in this dataset. I wonder if this disparity has always existed. I can study how the BMI has evolved through the years 1980 to 2008.

However, to do so I’ll have to convert the data from wide format to long format.

1.3. BMI data reshape

# Add the gender column that we'll use later
maleBMI$Gender <- 'M'
femaleBMI$Gender <- 'F'

# Convert the data from wide to long format
maleBMI_long <- melt(maleBMI,
                     id.vars = c('Country', 'Gender'),
                     variable.name = 'Year',
                     value.name = 'BMI')
femaleBMI_long <- melt(femaleBMI,
                     id.vars = c('Country', 'Gender'),
                     variable.name = 'Year',
                     value.name = 'BMI')

# And merge male and female data.
healthData <- rbind(maleBMI_long, femaleBMI_long)

# Little quality of life improvements
healthData$Gender <- factor(healthData$Gender)
healthData$Year <- factor(healthData$Year, 
                          labels = c(1980:2008))

Let’s see how the dataset structure is now.

## 'data.frame':    11542 obs. of  4 variables:
##  $ Country: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ Gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year   : Factor w/ 29 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BMI    : num  21.5 25.2 22.3 25.7 20.9 ...

The new dataset contains 11542 observations (199 Countries, 2 Genders, 29 time points) of 3 variables: Country, Gender and Year of the BMI observation.

1.4. BMI evolution in the last decades

I can now use a scatterplot to see how the BMI has evolved worldwide since the 80s to the 2008.

It seems that worlwide BMI it’s increasing. Is men and women BMI increasing at the same ratio?

From this data, we can conclude that from 1980 onwards the female BMI has been higher than the male BMI and both are increasing at a similar ratio worldwide.

1.5. Categorical BMI classification

This data is worrying, because a lot of observations are above the BMI treshold for overweight, and a few points are above obesity treshold. This means that the majority of population from some countries is obese. I’ll add a category to the data to make obvious this fact.

healthData$BMIClassification <- cut(healthData$BMI, 
  breaks = c(0,18.5,25,30,50),
  labels = c('Underweight', 'Healthy', 'Overweight', 'Obese'))

And plot the data using different colors for this new classification.

I wonder which countries are above the obese treshold for the majority of its population.

##  [1] "Cook Islands"          "French Polynesia"     
##  [3] "Nauru"                 "Palau"                
##  [5] "Samoa"                 "Tonga"                
##  [7] "Bermuda"               "Egypt"                
##  [9] "Kiribati"              "Kuwait"               
## [11] "Marshall Islands"      "Micronesia, Fed. Sts."
## [13] "Puerto Rico"           "Saint Kitts and Nevis"

The results are very interesting, because they are counter-intuitive if you are not familiarized with the data. However, these results relate perfectly to the current obesity problem in the Pacific countries (https://en.wikipedia.org/wiki/Obesity_in_the_Pacific).

This data would benefit a lot from being plotted on a world map.

Some countries are not displaying correctly, because the name of some countries do not match to the map names. Let’s see who they are.

# Determine which names are not matched to the map regions. 
# From the data to the map.
unique(healthData$Country)[!unique(healthData$Country) %in% 
                             unique(worldMap$region)]
##  [1] "Antigua and Barbuda"              "Armenia"                         
##  [3] "Azerbaijan"                       "Belarus"                         
##  [5] "Bermuda"                          "Bosnia and Herzegovina"          
##  [7] "British Virgin Islands"           "Central African Rep."            
##  [9] "Congo, Dem. Rep."                 "Congo, Rep."                     
## [11] "Cote d'Ivoire"                    "Croatia"                         
## [13] "Czech Rep."                       "Dominican Rep."                  
## [15] "Eritrea"                          "Estonia"                         
## [17] "Georgia"                          "Hong Kong, China"                
## [19] "Kazakhstan"                       "Korea, Dem. Rep."                
## [21] "Korea, Rep."                      "Kyrgyzstan"                      
## [23] "Latvia"                           "Lithuania"                       
## [25] "Macao, China"                     "Macedonia, FYR"                  
## [27] "Micronesia, Fed. Sts."            "Moldova"                         
## [29] "Montenegro"                       "Netherlands Antilles"            
## [31] "Palau"                            "Russia"                          
## [33] "Saint Kitts and Nevis"            "Saint Vincent and the Grenadines"
## [35] "Serbia"                           "Singapore"                       
## [37] "Slovak Republic"                  "Slovenia"                        
## [39] "Taiwan"                           "Tajikistan"                      
## [41] "Timor-Leste"                      "Trinidad and Tobago"             
## [43] "Turkmenistan"                     "Ukraine"                         
## [45] "United Kingdom"                   "United States"                   
## [47] "Uzbekistan"                       "West Bank and Gaza"              
## [49] "Yemen, Rep."
# And form the map to the data.
unique(worldMap$region)[!unique(worldMap$region) %in% 
                          unique(healthData$Country)]
##  [1] "Great Lakes"              "USSR"                    
##  [3] "Aral Sea"                 "Caspian Sea"             
##  [5] "Lake Malawi"              "USA"                     
##  [7] "French Guiana"            "Lake Titicaca"           
##  [9] "North Korea"              "Czechoslovakia"          
## [11] "South Korea"              "Black Sea"               
## [13] "Lake Pasvikelv"           "Yugoslavia"              
## [15] "Lacul Greaca"             "Liechtenstein"           
## [17] "UK"                       "Monaco"                  
## [19] "Western Sahara"           "Ivory Coast"             
## [21] "Central African Republic" "Zaire"                   
## [23] "Congo"                    "Gaza Strip"              
## [25] "Yemen"                    "Neutral Zone"            
## [27] "Vislinskiy Zaliv"         "Lake Albert"             
## [29] "Lake Tanganyika"          "Lake Kariba"             
## [31] "Lake Victoria"            "Great Bitter Lake"       
## [33] "West Bank"                "Micronesia"              
## [35] "Antarctica"               "Hawaii"                  
## [37] "Saint-Barthelemy"         "South Sandwich Islands"  
## [39] "Guadeloupe"               "Lake Fjerritslev"        
## [41] "Anguilla"                 "Saint Kitts"             
## [43] "Montserrat"               "Dominican Republic"      
## [45] "Sardinia"                 "Sicily"                  
## [47] "Sonsorol Island"          "New Caledonia"           
## [49] "Turks and Caicos"         "Tokelau"                 
## [51] "Maug Island"              "Pitcairn Islands"        
## [53] "Isle of Man"              "Saint Eustatius"         
## [55] "California"               "Andaman Islands"         
## [57] "Northern Mariana Islands" "Nevis"                   
## [59] "Madeira Islands"          "San Marino"              
## [61] "Sin Cowe Island"          "Tuvalu"                  
## [63] "Paracel Islands"          "Tobago"                  
## [65] "Azores"                   "Falkland Islands"        
## [67] "American Samoa"           "Cayman Islands"          
## [69] "Virgin Islands"           "Canary Islands"          
## [71] "Barbuda"                  "Trinidad"                
## [73] "Chagos Archipelago"       "Saint Vincent"           
## [75] "Spratly Island"           "Wales"                   
## [77] "Antigua"                  "Bonaire"                 
## [79] "Aruba"                    "Martinique"              
## [81] "Irian Jaya"               "Isle of Wight"           
## [83] "Saint-Martin"             "Curacao"

So the problem seems to be that the names on the geom_map regions are obsolete (‘USSR’, ‘Czechoslovakia’…), ‘Republic’ is abbreviate to ‘Rep.’ in our dataset, and other small errors. After some research, I found that the rworldmap package will be more accurate and easier to use.

I will also do one more step to get the country names to ISO3 Country Codes in order to minimize mismatches, using the countrycode package.

# Create the country codes
healthData$CountryCode <- factor(countrycode(healthData$Country,
                                             'country.name', 
                                             'iso3c'))

# Join the data with the rworldmap map
mapBMI <- joinCountryData2Map(healthData,
                              joinCode = 'ISO3',
                              nameJoinColumn = 'CountryCode')
## 11484 codes from your data successfully matched countries in the map
## 58 codes from your data failed to match with a country code in the map
## 46 codes from the map weren't represented in your data

Apparently, one country code (totalling 58 observations) failed to match with a region in the map. In any case, that’s much better than what we had before.

That’s better. I want to zoom to the Pacific area to visualize the countries with high obesity rankings.

Unfortunately, the countries are too small to see them clearly.

Two questions arise at this stage:

  1. Are those changes in BMI a consequence of bad dieting?

and

  1. How is this higher BMI correlated with cancer?

2. Is this increase in global BMI a consequence of the diet?

In short, is this BMI increase an effect of fast food and bad diets? Fast food is shown to contribute to high blood cholesterol levels. Let’s analyze how cholesterol levels are evolving worldwide.

2.1. Cholesterol Exploration

Exactly as did with the BMI, which distribution have the cholesterol levels worldwide?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.780   4.339   4.659   4.676   5.047   5.677
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.974   4.532   4.817   4.792   5.124   5.674

Again, it seems that females have higher choelsterol than males.

2.2. Cholesterol Data Reshape

In order to study the evolution of blood cholesterol levels and integrate this data with the BMI data, I have to convert it from wide to long format.

# Add the gender column that we'll use later
maleCholesterol$Gender <- 'M'
femaleCholesterol$Gender <- 'F'

# Convert the data to long format
maleCholesterol_long <- melt(maleCholesterol,
                     id.vars = c('Country', 'Gender'),
                     variable.name = 'Year',
                     value.name = 'Cholesterol')
femaleCholesterol_long <- melt(femaleCholesterol,
                     id.vars = c('Country', 'Gender'),
                     variable.name = 'Year',
                     value.name = 'Cholesterol')

# Group Male and Female Data
cholesterolData <- rbind(maleCholesterol_long, femaleCholesterol_long)
cholesterolData$Gender <- factor(healthData$Gender)
cholesterolData$Year <- factor(healthData$Year, 
                          labels = c(1980:2008))

# And finally, integrate!
healthData <- merge(healthData, cholesterolData)

2.3. Cholesterol evolution in the last decades

Is the cholesterol in blood concentration behaving equally as the BMI?

Interesting! Although the worldwide BMI is increasing, the cholesterol in blood levels are decreasing! I wasn’t expecting this at all. Nevertheless, I’ll continue with the analysis of the correlation between both parameters.

2.4. BMI & Cholesterol Analysis

Let’s view it from a BMI classification point of view.

Now, that makes a little more sense. Cholesterol levels in blood are descending in healthy and overweight populations, but stay steady on Obese populations!

Moreover, it seems that although the female cholesterol levels are higher worldwide, males have greater cholesterol levels in overweight and obese populations.

Taking into account these differences, is the BMI a good indicator for the cholesterol blood levels?

maleHealthData2008 <- subset(healthData, Year == 2008 & Gender == 'M')
femaleHealthData2008 <- subset(healthData, Year == 2008 & Gender == 'F')
cor(maleHealthData2008$BMI,maleHealthData2008$Cholesterol)
## [1] 0.714079
cor(femaleHealthData2008$BMI,femaleHealthData2008$Cholesterol)
## [1] 0.5141112
cor(subset(healthData, Year == 2008)$BMI,
    subset(healthData, Year == 2008)$Cholesterol)
## [1] 0.621037

As expected, there is a great correlation between BMI and cholesterol blood levels, specially on males! So, although cholesterol blood levels worldwide are descending, they appear highly correlated with the BMI, which is increasing worldwide.

At this stage, I realize that this data may be ‘biased’ because the nutritional differences between highly-developed and non-developed countries, where malnutrition happens and, in consequence, lower cholesterol levels are expected. In contrast, fast food is more typical of “westernized” countries.

2.5. Country classification

To take into account the aforementioned differences in country development, I’ll use the data of GDP(PPP) per country. I decided to use this data because of the following statement:

The measure that most economists prefer, therefore, is GDP (PPP) [“GDP based on purchasing power parity”] per capita. GDP (PPP) per capita compares generalized differences in living standards on the whole between nations because PPP takes into account the relative cost of living and the inflation rates of countries, rather than using just exchange rates, which may distort the real differences in income.

Source

I decided to arbitrarilly define as “rich” countries the ones at the top 25% of GDP(PPP), and as ‘poor’ countries the bottom 25% in relation to its GDP (PPP).

Let’s visualize the GDP distribution in a map for the 2008 data.

The poorest countries are found in Africa.

I’ll add the categoric classification for the GDP and integrate the data with the healthData from the year 2008.

# Get the 2008 data and rename the columns
GDP2008 <- CountryGDP[c('Country', 'X2008')]
names(GDP2008) <- c('Country', 'GDP')

# Add the categorical classification
GDP2008$GDPClassification <- cut(GDP2008$GDP, 
  breaks = c(min(GDP2008$GDP), 
            quantile(GDP2008$GDP, 0.25), 
            quantile(GDP2008$GDP, 0.75), 
            max(GDP2008$GDP)),
  labels = c('Poor', 'Middle', 'Rich'))

# There are not the same countries in the two datasets...
countriesWithBMI <- GDP2008$Country

# ... And I have to keep male and female BMI separated.
maleHealthData2008 <- subset(subset(healthData, 
                                    Gender == 'M' & Year == '2008'), 
                             Country %in% countriesWithBMI)
femaleHealthData2008 <- subset(subset(healthData, 
                                      Gender == 'F' & Year == '2008'), 
                               Country %in% countriesWithBMI)

maleHealthData2008 <- merge(GDP2008, maleHealthData2008, all=FALSE)
femaleHealthData2008 <- merge(GDP2008, femaleHealthData2008, all=FALSE)

# The final 2008 data frame
healthData2008 <- rbind(maleHealthData2008, femaleHealthData2008)

2.5.1 Country GDP & Cholesterol

Let’s do a simple scatterplot of Cholesterol levels vs GDP (PPP) factoring the Gender in.

This seems to be an exponential relationship. Let’s alter the scale.

Yeah! But this boxplots would suit more this type of graph! I’ll also mark the line between Healthy and Borderline high cholesterol levels (Source).

Let’s quantify that relationship.

## [1] 0.6987328

So we can conclude that blood cholesterol levels are highly correlated with country development. This fact may affect the relationship between cholesterol and BMI levels that we found before.

2.5.2 Country GDP & BMI

I wonder if the estimated country BMI is equally associated with its country development level.

Although poor countries tend to have lower BMI, the difference between middle and rich countries is not clear.

2.5.3. Country GDP, BMI & Cholesterol levels

Now it’s time to take into account the three parameters. In this context, it means that I will only consider the data from “developed” countries, and ignore the one from the “poorest” countries because I consider it unreliable (for the nutritional reasons explained above, and the medical that will explain below).

As I’m gonna be subsetting for this parameter in almost every analysis from here onwards, I decided to create a simple function for accessing it more easily. Similarly, I’m gonna do the same extremely easy functions with male and female subsetting.

# Returns the dataframe without the Poor and NA classifications for the 
# GDPClassification parameter. 
nonPoor <- function(dataframe) {
  return(subset(dataframe, 
         GDPClassification == 'Middle' | GDPClassification == 'Rich'))
}

# Returns the dataframe without subsetted by males of the Gender parameter.
males <- function(dataframe) {
  return(subset(dataframe, Gender == 'M'))
}
# Returns the dataframe without subsetted by males of the Gender parameter.
females <- function(dataframe) {
  return(subset(dataframe, Gender == 'F'))
}

When we ignore the data from the “poorest” countries, the correlation between BMI and cholesterol dissapears.

3. Cancer

Finally, let’s integrate what we have with cancer incidence data. Let’s start by taking a look at the cancer data and its distribution.

3.1. Cancer data structure

## 'data.frame':    173 obs. of  12 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ Breast.Cancer : num  26.8 57.4 23.5 23.1 73.9 51.6 83.2 70.5 31.5 54.4 ...
##  $ Cervix        : num  6.9 25.2 15.6 28.6 23.2 16.8 6.9 10.9 8.2 16.7 ...
##  $ Colon.Female  : num  4.5 19.3 5 3 19.1 7.9 35.9 27.8 4.3 14.7 ...
##  $ Colon.Male    : num  5.2 28.1 5.5 4.3 30.1 8.5 47.4 42.1 5.7 15.2 ...
##  $ Liver.Female  : num  2.5 3 1 3.7 1.9 2.6 1.3 2.9 1.9 2 ...
##  $ Liver.Male    : num  3.7 5.3 0.8 5.3 3.5 4.2 3.8 7.8 3.3 3.7 ...
##  $ Lung.Female   : num  2.9 13.2 2 1.3 8 7.5 16.8 14.3 6.1 5.1 ...
##  $ Lung.Male     : num  12.2 58.9 16.9 7.2 43.3 58.9 39.5 42.6 33 21.2 ...
##  $ Prostate.Male : num  4.5 15.1 5.6 12.7 36.8 10.2 76 71.4 7.5 65.3 ...
##  $ Stomach.Female: num  9.7 6.9 3.1 9.7 6 9.9 4.1 8.6 15.6 8.1 ...
##  $ Stomach.Male  : num  18.5 16.5 5.9 14.1 14.6 21.7 9.8 13.5 36 16.5 ...

The dataset contains the number of cancer cases for 100.000 habitants of different cancer types at several countries from the year 2002. I will use the 2002 country data for the next analysis in consequence.

3.2. Cancer data distribution

##    Country          Breast.Cancer       Cervix       Colon.Female  
##  Length:173         Min.   :  3.9   Min.   : 2.00   Min.   : 0.90  
##  Class :character   1st Qu.: 20.6   1st Qu.:10.80   1st Qu.: 3.80  
##  Mode  :character   Median : 30.0   Median :20.30   Median : 8.30  
##                     Mean   : 37.4   Mean   :23.19   Mean   :11.98  
##                     3rd Qu.: 50.3   3rd Qu.:30.40   3rd Qu.:18.00  
##                     Max.   :101.1   Max.   :87.30   Max.   :42.20  
##    Colon.Male     Liver.Female      Liver.Male     Lung.Female    
##  Min.   : 1.00   Min.   : 0.200   Min.   : 0.80   Min.   : 0.100  
##  1st Qu.: 5.10   1st Qu.: 1.900   1st Qu.: 3.80   1st Qu.: 2.200  
##  Median :10.30   Median : 3.100   Median : 6.10   Median : 5.500  
##  Mean   :16.13   Mean   : 4.982   Mean   :11.55   Mean   : 6.932  
##  3rd Qu.:25.60   3rd Qu.: 5.600   3rd Qu.:15.30   3rd Qu.:10.100  
##  Max.   :58.50   Max.   :57.300   Max.   :98.90   Max.   :36.100  
##    Lung.Male     Prostate.Male    Stomach.Female    Stomach.Male  
##  Min.   : 0.50   Min.   :  0.30   Min.   : 0.600   Min.   : 0.60  
##  1st Qu.: 7.50   1st Qu.:  8.40   1st Qu.: 3.700   1st Qu.: 6.30  
##  Median :19.50   Median : 19.30   Median : 6.300   Median :12.30  
##  Mean   :25.75   Mean   : 26.52   Mean   : 7.965   Mean   :14.79  
##  3rd Qu.:41.10   3rd Qu.: 36.40   3rd Qu.: 9.900   3rd Qu.:19.80  
##  Max.   :94.60   Max.   :124.80   Max.   :30.600   Max.   :69.70

The distribution for the majority of cancer incidences appears to be somewhat long-tailed. Again, I worry that the development and, in consequence, medical advancement of the countries affects the number of cancers detected. For this, I’ll take into account the country GDP in the next analyses too.

3.3. Cancer data integration

Next, I’ll integrate the number of cancer cases with the health and economic 2002 data.

# This time I need the 2002 data. First the economic parameters.
GDP2002 <- CountryGDP[c('Country', 'X2002')]
names(GDP2002) <- c('Country', 'GDP')
GDP2002$GDPClassification <- cut(GDP2002$GDP, 
                                 breaks = c(min(GDP2002$GDP), 
                                            quantile(GDP2002$GDP, 0.25), 
                                            quantile(GDP2002$GDP, 0.75), 
                                            max(GDP2002$GDP)),
                                 labels = c('Poor', 'Middle', 'Rich'))

# And now the health data. I separate men and women because there are specific
# cancer types for the genres.
# First, subset the BMI & Cholesterol levels
maleHealthData2002 <- subset(males(healthData), Year == 2002)
# Add the economic data
maleHealthData2002 <- merge(maleHealthData2002, GDP2002, all = FALSE)
# Add the cancer incidence data
maleHealthData2002 <- merge(maleHealthData2002, 
                            Cancer2002[c('Country', 'Prostate.Male', 
                                         'Colon.Male', 'Liver.Male', 
                                         'Lung.Male', 'Stomach.Male')], 
                            all = FALSE)
# Change the column names so it matches with the female data
names(maleHealthData2002) <- c('Country', 'Gender', 'Year', 'BMI', 
   'BMIClassification', 'CountryCode', 'Cholesterol', 'GDP', 
   'GDPClassification', 'Prostate', 'Colon', 'Liver', 'Lung', 'Stomach')
# Add the Breast and Cervix with NA
maleHealthData2002$Cervix <- NA
maleHealthData2002$Breast <- NA

# The same for female data
femaleHealthData2002 <- subset(females(healthData), Year == 2002)
femaleHealthData2002 <- merge(femaleHealthData2002, GDP2002, all = FALSE)
femaleHealthData2002 <- merge(femaleHealthData2002, 
                              Cancer2002[c('Country', 'Breast.Cancer', 'Cervix',
                                           'Colon.Female', 'Liver.Female', 
                                           'Lung.Female', 'Stomach.Female')], 
                              all = FALSE)

names(femaleHealthData2002) <- c('Country', 'Gender', 'Year', 'BMI', 
   'BMIClassification', 'CountryCode', 'Cholesterol', 'GDP', 
   'GDPClassification', 'Breast', 'Cervix', 'Colon', 'Liver', 'Lung', 'Stomach')
femaleHealthData2002$Prostate <- NA

# And joint the two datasets
healthData2002 <- rbind(maleHealthData2002, femaleHealthData2002)

3.4. Non-gastrointestinal cancers

There are 3 gender-specific, non-gastrointestinal cancers in the dataset: brest, cervix, and prostate cancer. I’ll analyze them first because they should be less correlated with nutriotional data than gastrointentestinal cancers.

3.4.1. Prostate Cancer

Prostate Cancer only affects males so it should be more easy to work with. As reasoned above, I’ll take into account the country GDP because it is expected that more cancers are detected in richer countries because of better health systems.

There actually seems to be certain correlation between the BMI and cholesterol levels and the prostate cancer incidences, which I didn’t expect.

A rapid search confirms this findings:

Epidemiological studies have associated high blood-cholesterol levels with an increased risk of Prostate Cancer.

Source

3.4.2. Breast and Cervix Cancers.

Breast and cervix cancers only affect women. Let’s see its correlation with BMI and cholesterol levels.

It appears that choelsterol levels affect Breast cancer incidence, but BMI does not.

We can see how cholesterol is highly correlated with Breast Cancer incidence. And again, it seems that we are supported by science.

High cholesterol levels may increase a woman’s risk of developing breast cancer, a large new British study reports.

Source

In contrast, cervix Cancer does not appear affected by this parameters.

3.4.3. Lung Cancer

The last non-gastrointestinal cancer that we have incidence data is Lung cancer. Lung cancer affects both men and women, so we can study both groups in this analysis.

Again, cholesterol levels are correlated with higher Lung Cancer incidence in male and female populations, but specially in men.

Let’s model this correlation in “non-poor” countries.

Again, it appears that we are somewhat backed by science. In a Hawaii study, it was found that:

The results showed a significant positive association of dietary cholesterol and the risk of lung cancer in men, but not in women.

Source

3.5. Gastrointestinal cancers

It’s now time to analyze the incidence of gastrointestinal-related cancers. It can be hypothesized that gastrointestinal-related cancers such as Colon, Stomach or Liver cancers may have more relationship with dietary and nutritional parameters.

To speed up the process of analyzing those cancers, I’m gonna use a for loop to make and scatterplot of all of them and focus on the most insteresting ones.

gastrointestinalCancers <- c('Stomach', 'Colon', 'Liver')

for (cancer in gastrointestinalCancers) {
  # NOTE: I should do a for loop (parameter in parameters), but I don't get the
  # x axis to work, although the rest is all ok.
  bothBMI <- ggplot(aes(x = BMI, 
                        y = nonPoor(healthData2002)[[cancer]]), 
      data = nonPoor(healthData2002)) +
      geom_point() +
      xlab('BMI') +
      ylab(cancer)
  sexBMI <- bothBMI + facet_wrap(~Gender, ncol = 2)
  
  bothCholesterol <- ggplot(aes(x = Cholesterol,
                         y = nonPoor(healthData2002)[[cancer]]), 
      data = nonPoor(healthData2002)) +
      geom_point() +
      xlab('Cholesterol') +
      ylab(cancer)
  sexCholesterol <- bothCholesterol + facet_wrap(~Gender, ncol = 2)
  
  grid.arrange(bothBMI, bothCholesterol, sexBMI, sexCholesterol, ncol = 2)
}

There seems to be a trend in the Colon cancers with our nutritional parameters. I also notice some outliers in the liver cancer plots, but it doesn’t appear to mask any trend.

Let’s focus on the colon cancer correlation.

So although BMI does not appear to have an effect in the number of gastrointestinal cancers, high blood cholesterol levels do.

3.6. Summary Correlation Cholesterol and Cancer Incidence

As a summary, let’s plot the scatterplot in which we found a correlation with cholesterol.

3.7. BMI, Cholesterol & General Cancer Incidence

Finally, let’s take a step back and study the simple relationship between BMI and cholesterol and general cancer incidence.

3.7.1. Data reshape

To do so, I’ll convert the data from wide to long format, and add two new columns, in which I round the BMI and cholesterol levels to certain decimals to easily group the data and plot histograms.

# Remove the columns we don't want
healthData2002Long <- nonPoor(healthData2002)[-c(3,5,6,8,9)]
# Convert the data to long format
healthData2002Long <- melt(healthData2002Long,
                        id.vars = c('Country', 'Gender', 'BMI', 'Cholesterol'),
                        variable.name = 'TypeOfCancer',
                        value.name = 'NumberOfCases')
# Round BMI to the nearest 0.5
healthData2002Long$BMIRounded <- round(healthData2002Long$BMI*2,0)/2
# Round Cholesterol to the nearest 0.1
healthData2002Long$CholesterolRounded <- round(healthData2002Long$Cholesterol,1)

3.7.2. Cancer Incidence Distribution

Let’s plot the data to view the distribution.

3.7.2. Cancer Incidence Normalization

That’s nice, but we have to normalize the data, taking into account the number of cases for each BMI and cholesterol level occurrence. I’ll use dplyr to group the data. It is important that we ignore the NAs caused by gender-specficic cancers here.

BMICancerSummary <- healthData2002Long %>%
  group_by(BMIRounded, Gender, TypeOfCancer) %>%
  summarize(meanNumberOfCancers = mean(NumberOfCases, na.rm = TRUE),
            n = n()) %>%
  arrange(BMIRounded)

CholCancerSummary <- healthData2002Long %>%
  group_by(CholesterolRounded, Gender, TypeOfCancer) %>%
  summarize(meanNumberOfCancers = mean(NumberOfCases, na.rm = TRUE),
            n = n()) %>%
  arrange(CholesterolRounded)

# As an additional step, I'll change all the NaN values to 0.
BMICancerSummary$meanNumberOfCancers[
  is.nan(BMICancerSummary$meanNumberOfCancers)] <- 0
CholCancerSummary$meanNumberOfCancers[
  is.nan(CholCancerSummary$meanNumberOfCancers)] <- 0

Let’s plot the normalized data.

That makes more sense. It is interesting how the mean number of cases increases with the cholesterol levels. However, here we have the data for male and female population grouped, and there are specific cancers for each gender in the plot. Let’s view its gender in more detail.

A part from the number of cancers per parameter, I’d like to see the composition of this cancers, to corroborate that some types of cancers appear more at different BMI and Cholesterol, independently of the total number of cancer incidence.

3.7.3. Male Cancer Incidence

These graphs give an overall perspective on the cancer relationship with BMI and cholesterol levels in male populations. We can clearly see how, similar to what we saw with the combined populations, the number cancer incidence rises at higher cholesterol levels, and how the percentage of its cancer type evolves during this evolution. For example, it is clear here that colon cancer incidence is higher at higher cholesterol levels, while the percentage of prostate cancer to the rest of cancer types is not dependent of the cholesterol levels.

3.7.4. Female population

The results for the female population are highly similar to the ones obtained for the male population. It is interesting to see how the breast cancer percentage in comparison to other cancer types rises at higher cholesterol levels, while the stomach cancer percentage diminishes.

3.8. Cholesterol and Common Cancers Types in Men and Women

To finish this data exploration study, I’d like to visualize the evolution of different cancer types common to both men and women with the blood cholesterol levels, as it’s the most striking relationship that I found in this study. To do so, I’ll group the data once again, this time without separating by gender.

AllCholCancerSummary <- subset(CholCancerSummary,  TypeOfCancer != 'Cervix' & 
                       TypeOfCancer != 'Breast' & TypeOfCancer != 'Prostate')
AllCholCancerSummary <-  AllCholCancerSummary %>%
  group_by(CholesterolRounded, TypeOfCancer) %>%
  summarize(meanNumberOfCancers = mean(meanNumberOfCancers),
            n = n()) %>%
  arrange(CholesterolRounded)

And plot the data.

I think this plot summarizes perfectly the relationship that I was searching for between nutritional parameters and cancer incidence. My hypothesis was that I would find this relationship with the BMI, as its the used indicator for obesity, but finally found that the cholesterol blood levels are a better predictor for this data.

4. Final Plots and Summary

In this study I wanted to analyze if the increased obesity worldwide was related to a worsened diet, and what effect would this have on the number cancer levels at a population level.

Obesity is increasing worldwide.

This scatterplot provides an interesting reflection on the evolution of the BMI worldwide.

As a reminder, the current method for determining obesity classifies an individual as underweight if its BMI is below 18.5. Healthy individuals are expected to have a BMI between 18.5 to 25, while overweight individuals comprise BMIs between 25 and 30. Finally, an individual with a BMI greater than 30 is classified as obese.

Although some populations could be classified as underweight at the early 80s, no population average currently at this level. However, what is worrying is that the 2008 BMI estimate for more than half of the countries populations was greater than 25: at an overweight level. Even more, some Pacific countries population have BMI estimates greater than 30, which directly means that the majority of the population on these countries is obese.

The health implications of this obesity epidemic must be (as they are) a medical priority.

Cholesterol blood levels and BMI are not correlated in worldwide populations.

I then wanted to study if I could correlate this BMI increase to a bad diet. Although this information is difficult to address, some indirect indicators of diet type, such as cholesterol levels, may be interesting to analyze. Fast food and bad dieting increase Cholesterol blood levels, so I thought that maybe there was a correlation between the cholesterol levels and the BMI.

One important consideration when analyzing this country data was the development level of every country, because of the nutritional diferences between “westernized” and “non-westernized” countries (see 2.5.1). To take this into account, I ignored the data from the bottom 25% “poorest” countries as determined by its GDP (PPP).

However, I did not find any meaningful relationship between the BMI and blood Cholesterol levels ata population level.

High Cholesterol blood levels are correlated with higher cancer cases.

I finally analyzed the relationship between the different population BMI and cholesterol levels and the incidence of different types of cancer. Although I did not found any special relationship between BMI and cancer incidence as I was expecting, I found this relationship to be with the blood cholesterol levels.

Higher cholesterol blood levels correlated with higher cancer incidences of different types in both male and female populations (see 3.6) . As cholesterol in blood levels are highly related to the diet, my initial guess was that I would find more meaningful correlations between those levels and gastrointestinal cancers, but only found it to be true with Colon Cancer.

5. Reflection

The datasets used in this study contain information of different parameters from about 200 world countries. Sometimes I struggled to maintain the reason because the data for male and female populations was separated. This enabled finer exploration of the parameters, but also complicated a lot of otherwise simple processes. I started exploring the principal parameter that I wanted to study: the population BMI, and could create a linear model for its worlwide increase.

Eventually I integrated this data with the only diet-related parameter I could find in the Gapminder database: the total cholesterol blood levels. As I was analyzing this relationship, I realized that I shouldhave into account the differences between country development, and I also integrated data for each country GDP (PPP). Unfortunately, I did not find the relationship I was expecting between BMI and Cholesterol levels, and I could not hint to a certain effect of the bad dieting to this overweight and obesity increase worldwide.

Nevertheless, I continued my exploration to the main question: are those changes in obesity affecting the cancer incidence? I decided to analyze the cholesterol data too because it already was available, and found in this variable that the correlation that I was searching in the BMI but didn’t found. This relationship was specially highlighted in the Colon Cancer, where I draw a linear model of the correlation. I struggled in the normalization of the data, as I had to account for the number of cases each observation was made in order to depict a correct ratio of cancer incidences to each parameter.

Finally, I’d like to talk a little about the data and conclusions that can be drawn from this study. I was interested in analyzing data regarding metabolism and health because studying the realtionship between the metabolic state and inflammation was an important part of my PhD. The data used here is at the epimidiologic level because working with individual data is protected and requires special permissions. However, to draw strong conclusions between the factors studied here, the individual data is more suited. Moreover, we do not have available data for all types of cancer, and the time points are not complete. Not only that, but the analysis of one only timepoint is probably not correct, as the timescale for the cancer development as a result of the different variables here studied should be taken into account. Finally, BMI is not perfect and more specific parameters may be better suited for this type of study.